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Abstract 





In Field Programmable Gate Array (FPGA) platform, Finite Impulse Response (FIR) filter is one of the 
important applications in the context of Digital Signal Processing (DSP). The traditional FIR is designed 
using a number of the adders, multipliers which enlarge the area of the filter architecture. Generally, the 
multiplier and adder are required to design the FIR filter. In this research, Array Multiplier (AM) is used 
in the Processing Element (PE) for multiply the filter inputs with coefficients. This research employs the 
Carry Increment Adder (CIA) in the accumulator for the adding output of the PE. The proposed method is 
named as AM -CIA-FIR filter. Due to the usage of AM and CIA adder, the hardware utilization of the 
proposed work is improved. The AM -CIA-FIR filter is implemented in Xilinx ISE software by using 
Verilog code on different Virtex devices in terms of Virtex-4, Virtex-5, and Virtex-6. This experiment 
results showed that AM -CIA-FIR filter has reduced 14.01 % of the FPGA utilization compared to the 
PSA-FIR filter design. 

Keywords: Array Multiplier, Carry Increment Adder, Field Programmable Gate Array, Finite Impulse 
Response, and Processing Element 





Moreover, the structural adders are employed in 


1. Introduction the FIR filter, which is more expensive [8]. The 

The FIR filtering is one of the basic step in sensitivity driven algorithm is used FIR filter 
several DSP applications: wireless communication, design to quantify the contribution of a non-zero 
video processing, and image processing [1]. The digit of coefficient set to its frequency response 
FIR filter is detected extensive applications in the characteristic is enhanced weight two-sub 
mobile communication systems to perform the expression. But, the proposed algorithm has 
stability and the linear phase properties [2], [3], [4]. increased computational complexity [9]. The fixed 
In the past decades, several efficient techniques and FIR filter is designed by using Multiple Constant 
structures have been analysed, that are widely Multiplication (MCM) scheme. The derivation of 
known to the academic and industrial communities flow graph for transpose form FIR filter block is 
[5]. Nowadays, less area and power FIR filter is reduced the complexity of register eventhough the 
important in the DSP research area for efficient mathematical analysis is complex for MCM 
signal processing applications which performs in _ scheme [10]. In this research, AM -CIA-FIR filter 
different taps [6], [7]. is designed to evaluate the number of Lookup 


Table (LUT), frequency, flip flop, and slices. In 
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AM, sign of all the partial products bits are 
positive which need for compliment of each 
multiplier and multiplicand bit. AM is used in 
High Performance Multiplier (HPM) tree which 
inherits regular and repeating multipliers 

such as Baugh wooley multiplier, Wallace tree 
multiplier, etc. Due to the presence of less logical 
elements in AM, the overall power consumption, 
area, and delay of the entire architecture has 
improved. The signed values also performed the 
multiplication operation in AM. Moreover, CIA 
helps to reduce the propagation delay and perform 
the addition logic quickly. Due to the usage of CIA 
in the architecture, amount of time has reduced to 
determine the carry bit and no need to wait for 
carry for determine the sum output. With the help 
of AM and CIA, all the FPGA performances are 
improved in AM -CIA-FIR filter when compared 
to the conventional FIR filter. Rest of this paper as 
follows, Section-2 details the FIR filter based 
survey papers. In Section-3, proposed method 
architecture and internal module has explained. 
Section-4 shows the setup and results of existing 
and proposed AM-CIA-FIR strategies and Section- 
5 concludes the proposed work. 
2. Literature review 
Pramod Patali and Shahana  Thottathikkulam 
Kassim [11] proposed two efficient structures of 
FIR filter with increased throughput reduced 
latency and hardware complexity. Two modified 
CSLA modules (linear CSLA and square root 
CSLA) were obtained by concatenating the ideas 
of improved carry select and carry skip adder. 
Critical path delay analysis was carried out for 2 
CSLA modules and stated that square root CSLA 
had minimum CPD of about 0.23ns than linear 
CSLA module. Comparison results stated that 
CDP, Power, PDP and ADP for filter 2 were 
reduced by 71, 38, 82 and 78% respectively than 
filter 1. Though filter 2 achieved delay 
improvement, cost of area and power increased 
because delay efficient multiplier was suitable 
only for time consuming path. Thiruvenkadam 
Krishnan et al. [12] designed a high-speed area 
efficient RCA based 2-D bypassing multiplier for 
FIR implementation. It eliminated the carry 
multiplexer in all logic cells which was used in 
bypassing technique and worked based on divide 
and conquer principle to shorten the delay time. 
For example, a 4x4 multiplication was divided 
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using two 4x2 bypassing multiplier where partial 
sum and carry outputs were computed 
simultaneously with reduced delay time. A 4-Tap 
FIR filter was designed using the proposed 
technique and implemented by Altera Quartus II 
tool with cyclone EP1IC12F324C6. Results 
concluded that this module had a reduction of 15% 
LUT, 15% power and 10% increase in speed. 
Implementation of divide and conquer principle 
was not suitable for fast adder like CSA. 

Radha Rammohan S ef al. [13] developed an 
approximate 4:2 compressor adder in memoryless 
DA based FIR filter architecture. The main 
emphasis of this design was to reduce the area and 
power consumption for hearing aid applications. 
Memoryless DA architecture was designed using 
compressor adder because the area of ROM 
significantly increases with respect to filter order. 
The proposed design was reconfigurable and the 
filter co-efficient can be changed during the run 
time. Using 90nm technology in synapsis ASIC 
design compiler it was synthesized and showed 
minimum area (14445u°m), ADP(20011 p’m x 
ns), MSP(1.32ns), MSF(648MHz) and 
PDP(11.48mW x ns). Problem arose _ that 
compressor produce only approximated value, not 
accurate value which affects the filtering 
performance. 

Samyuktha S and Chaitanya et al. [14] DL 
proposed an effective FIR filter using the 
multiplication principle of vedic mathematics and 
Ripple carry adder. Frequently, Single Constant 
Multiplication (SCM) and Multiple Constant Multi 
plication (MCM) were used in FIR 
implementation. But, time and efficiency had 
become a conflict in configuring an effective FIR 
filter. Thereby, vedic multiplier was used to deal 
with conflicts and reduced time lost by half. Even 
though conflicts were solved cross-checking of the 
results was difficult and also the identification of 
vedic mathematic classification was a problem. 
Shinwoong Park ef al. [13-15] developed an 
analog FIR filter system specially designed for full 
bandwidth utilization in communication by using 
split capacitive DAC’s. Split DAC’s acted as 
multiplier co-efficient that were controlled by 7-bit 
codes to provide high linearity over the full 
frequency range. Analysis of noise and effect of 5- 
channel time interleaved operation was also 
conducted. AFIR filter was implemented in 32-nm 
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SOI CMOS technology and achieved 11dBm IIP 
over the frequency range with 0.9v supply and had 
better filtering performance. Significantly, aliasing 
issue occurred for higher order filter with increase 
in noise level, power consumption (10.6mW) and 
intricacy in clock distribution. 


The major problems of FIR filter are mentioned 
below, 


e Normally, for designing the FIR filter more 
logical blocks are required which causes 
more area and power. 


e The filter operation took more time due to 
the usage of unwanted blocks. 


e Co-efficient and inputs are too difficult to 
store with allocate memory. 


e Normal addition operation occupied large 
area. 


Solution: 


To conquer the above problems, an efficient 
AM-CIA-FIR filter is designed to improve the 
performance of proposed architecture. AM helps to 
perform the multiplication operation with high 
speed. CIA adder helps to perform the addition 
operation with less area. In this research work, the 
adder is used in accumulator module. So, the 
importance of addition process is more. So, here 
CIA adder used instead of using normal 
adder[ 16,17] 


3. AM -CIA-FIR Filter Design 

The multipliers and adders are more important 
to perform the digital FIR filters. These arithmetic 
circuits are highly responsible for area 
consumption. Furthermore, the multiplier is much 
capable for high-speed, low power, low area, and 
compact VLSI implementation. So, FIR filter is 
designed by using AM and CIA in this research. 


A. Proposed FIR Filter Design 

FIR filters are one of the non-recursive filters 
[17], which is used for adding of the input samples 
and it multiplied by constants. FIR filter is known 
as a convolution in DSP, which is represented by 
Eq. (1). 


y(n) = Doli 1] (1) 
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Here, Number of taps of the filter structure is 
denoted as N,y(n)represents filter output, d[i]is 


coefficient of theN-FIR filter length, x(n-1) 


represented the number of input sequence. Block 
diagram of the AM -CIA-FIR filter design is 
represented in the Fig. 1. In FIR, the channel 
coefficients has been put away in the ROM and 
channel input has been put away in the RAM. The 
address generator creates the information address 
and it assists with perusing the information from 
ROM, and RAM to acquire the channel 
information and coefficient information. The 
information reader module gives the input 
information; ceaseless tasks are done as follows. At 
first, figure the memory address of the new 
information, that information are empower to RAM 
and store the information in the RAM as indicated 
by the location. The coefficients are stored in ROM 
which is performed the PE operation with RAM 
data. The PE output is connected to the 
accumulator block to produce the FIR filter output. 
AM and CIA is utilized in PE module and 
accumulator module to improve architecture 
performance 


Co efficient 





Fig.1 AM -CIA-FIR filter architecutre. 

B.Array Multiplier 

In DSP applications, due to the advancements 
in current technologies design targets for better 
performance are mainly concentrated, where 
critical path delay and performance of the 
processor depends on multiplier block. Array 
multiplier is much suitable modest architecture due 
to its less design, time complexity and perform fast 
multiplication in pipelined manner. It is a digital 
combinational circuit to perform multiplication of 
two n-bit numbers based on ADD and SHIFT 
algorithm. Step by step process block diagram is 
represented in the Fig. 2. 
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N Bit Multiplicand 






N Bit Multiplier 


PARTAIL PRODUCT GENERATION 
PARTAIL PRODUCT SUMMATION 
(SHIFT AND ADD) 
Nx N BIT PRODUCT 


Fig.2.Block diagram of Multiplier 












For n x n array multiplier, it requires n*n AND 
gates, n*(n-2) Full Adders (FA) and n Half Adders 
(HA). General hierarchical structure of 4x4 array 
multiplier is shown in the Fig. 3. Consider A as 
Multiplicand and B as Multiplier to produce P 
product terns which is shown in equation 2. 

P = A(Multiplicand bit) * B(Multiplier bit) 
(2) 
First step in the design is partial product 
generation. Each bit of multiplicand is ANDed with 
a single bit of multiplier to generate n partial 
products (Aj. By). The partial products are shifted 
based on the bit order and added. Then product bits 
are formed using adders in each column i=(j+k). 
Here, adders are arranged in carry save fashion in 
which carry out bits are fed to the next available 
adder in the column to the left. Final product is 
obtained from final adder thus improving delay and 
area. Interms of speed, array multiplier outperforms 
serial multiplication scheme as a_ parallel 
multiplication. 

For 4x4 array multiplier, it requires 16 AND 
gates, 8 Full Adders and 4 Half Adders. Similarly, 
for 8x8 array multiplier, 64 AND gates, 48 Full 
Adders and 8 Half Adders are required. 


C. Carry Increment Adder 

Adder is also one of the basic building block of 
DSP processor where series of repeated additions 
are performed to speed up the multiplier operation. 
In order to speed up the multiplier operation, 
addition speed must be increased. Fast adders are 
used because it performs faster than conventional 
adders like RCA, CLA etc. Among the fast adders 
Carry Increment Adder (CIA) has better delay 
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performance an important attribute in the high 
speed devices. 
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Fig.3 Logical design of 4X4 array multiplier 


Carry Incremental Adder (CIA) contains two 
essential blocks one is RCA and other one is 
incremental circuitry block. Incremental circuit is 
designed in a sequential flow using half adders in 
ripple carry chain. Here, addition operation is 
carried out by several RCA’s and splitting the total 
number of bits into groups of 4-bits. Ripple Carry 
Adder (RCA) has cascade structure of multiple full 
adders for n bit input sequence. Thus carry will be 
generated in each full adder block. First stage carry 
output is rippled to second stage full adder acting 
as carry input and the process continues upto last 
stage shown in Fig. 4 














B3 A3 B2 A2 B1 I ais 
c3. | Full c2 | Full C1 Full ; 
Cout | Adder | Adder | Adder Ss 
es S2 S1 so 


Fig.4 Logical diagram of RCA 


While in carry select scheme two partial sums are 
computed and the correct one is selected with the 
help of multiplexers. This increases the area and 
affects the speed. Thus incremental circuit replaces 


N 
Nn 
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the second adder and multiplexer block and 
calculates only one partial sum and increment it if 
necessary. The block diagram representation of 8- 
bit CIA is shown in the Fig. 5. 
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Fig.5.Logical diagram of CIA 

4. Results and Discussion 
The proposed AM-CIA-FIR filter is designed and 
analyzed in different FPGA devices. The 
architecture module has implemented in Modelsim 
10.5 software to verify the output waveform. 
Xilinx ISE 14.7 software helps to calculate the 
FPGA performances 

In this research, the AM-CIA-FIR and existing 
FIRs filter designs also implemented in Xilinx tool 
that results are tabulated, which is shown in the 
table 1. The number of LUT, slice, flip-flop, 


m PSA-FIR [16] 


LUT COUNTS 
104 
Ma 95 
a 


16 TAP 
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D. DSP applications 

In DSP applications, the filtering method is one 
of the important process to eliminate the unwanted 
information which is present in the input data. This 
filtering process is possible to implement by using 
this proposed work. Normally, the filtering process 
is used to remove the noises (Salt and pepper/ 
Flicker noise) in the images and the bio medical 
signals such as_ Electrocardiogram (ECG), 
Electroencephalogram (EEG), and_ Electro 
myogram (EMG). Due to the usage of AM with 
CIA architecture, the FIR filter can reduce the 
noises which are present in the Images or signals. 
For the noise reduction, the noisy input images or 
noisy biomedical signals are read in MATLAB 
software to convert into binary values which are 
stored in the ROM for performing the processing 
element. This proposed FIR architecture removed 
the noisy pixels values and generated the noise free 
image or signals in the output terminal. So, the 
proposed FIR filter architecture also used in DSP 
applications. 


frequency and Input-Output Block (IOB) are 
analyzed through the FPGA implementation. The 
performance of the AM-CIA-FIR filter is analyzed 
in Virtex-4 xc4vfx12, Virtex-5xcS5vlx20T and 
Virtex-6 xc6vcx75t. In table 1, Virtex-6 is given 
best results compared to Virtex-4 and Virtex-5. 
Performance reduction of the Vertex-6, 3.81% of 
the LUT, 9.9 % of the flip-flop, 29.81% of the slice 
compared to the PSA-FIR filter design, 
respectively. The 0.166 (W) of power is consumed 
at a frequency of 247.78 MHz. 


m AM-CIA-FIR 
a 
ol 
a 
00 
wo roa) 
| 
32 TAP 


DIFFERENT TAPS 


Fig.6. Comparison performance of LUT for PSA-FIR [16] and AM-CIA-FIR filters on Virtex-6. 
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32 TAP 


Fig.7.Comparison performance of flip-flop for PSA-FIR [16] and AM-CIA-FIR filters on Virtex-6. 


TABLE. 1 EXPERIMENT RESULTS OF FPGA PERFORMANCE FOR EXISTING AND AM-CIA-FIR FILTER DESIGN 
















































































Target FPGA | Methodology/Tap LUT Flip-flop _| Slice pve an 
8tap | 168/10944 | 99/10944 | 144/5472 | 2405 26/240 

os PSA-FIR [16] | 16-tap | 180/10944 | 100/10944 | 126/5472 | 240.5 26/240 
belean 32-tap | 193/10944 | 101/10944 | 153/5472 | 240.5 26/240 
8-tap | 135/10944 | 71/10944 | 94/5742 | 253.04 25/240 

AM-CIA-FIR [16-tap | 144/10944 | 71/10944 | 97/5742 | 254.21 25/240 

32-tap | 164/10944 | 65/10944 | 98/5742 | 245.61 25/240 

8-tap | 69/12480 | 21/12480 | 27/3120 | 174.99 26/172 

PSA-FIR [16] | 16-tap | 71/12480 | 22/12480 | 25/3120 | 174.99 26/172 

vintexs 32-tap | 74/12480 | 23/12480 | 23/3120 | 174.99 26/172 
xe5vix20T 8-tap | 58/12480 | 18/12480 | 14/3120 | 191.22 25/172 
AM-CIA-FIR | 16-tap | 59/12480 | 17/12480 | 19/3120 | 193.25 25/172 

32-tap | 60/12480 | 15/12480 | 20/3120 | 188.47 25/172 

8-tap | 104/46560 | 80/93120 | 57/11640 | 247.39 26/240 

PSA-FIR [16] | 16-tap | 108/46560 | 81/93120 | 54/11640 | 247.39 26/240 

Vitex6 32-tap | 111/46560 | 83/93120 | 42/11640 | 247.39 26/240 
cOvex 75 8-tap | 95/46560 | 68/93120 | 35/11640 | 247.78 25/240 
AM-CIA-FIR [16-tap | 96/46560 | 72/93120 | 45/11640 | 247.78 25/240 

32-tap | 98/46560 | 73/93120 | 34/11640 | 247.78 25/240 











The filter has designed different taps such as 8 
taps, 16 taps and 32 taps. Fig.6, 7 and 8 shows the 
comparison graph of the LUT, flip-flop and slice 
performance for PSA-FIR [16] and AM-CIA-FIR 
filter on Virtex-6. This pictorial representation 
graph is clearly states that the proposed method 
has better FPGA performance when compared to 
the conventional designs Fig.9 details the FIR 


filter results which contains control module and 
operational block. The data out values are 
randomly generated from the RAM, which are 
multiplied with coefficient. In AM-CIA-FIR filter, 
AM is used to perform multiplication operation in 
PE. For example, the initial clock cycle output is 
252 after multiplication of 36 and 7. This filter 
output is stored in y register which contain zero 
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initially. After performing the initial clock FIR 
filter, the output is added with y and stored in same 
y register. Finally, the accumulator produced the 
FIR filter outputs. The number of the output bits 
depends on the number of input bits. The AM- 
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CIA-FIR filter is designed based of 8-bit input and 
a8-bit coefficient and it gives filter output as 16- 
bit. In this research, FIR filter is designed by using 
16-bit AM and 16-bit CIA design. 


= PSA-FIR [16] ®AM-CIA-FIR 
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Different taps 


Fig.8.Comparison performance of Slice for PSA-FIR [16] and AM-CIA-FIR filters on Virtex-6. 





Fig.9.Waveform of the proposed architecture 


Conclusion 

In this work, BW multiplier and CIA have been 
used to expand the FIR filter operation. This 
optimal multiplier and optimal adders have been 
used in the processing element module. The 
proposed AM-CIA-FIR filter was implemented in 
different FPGA devices FPGA has developed as a 
platform of the choice for efficient and faster 
realization of the computer intensive applications. 
The main aim of the proposed method is to 
mitigate the area complexity of the product 
accumulation block using 16-bit CIA. This 
proposed architecture was designed for different 
taps such as 8, 16, and 32. In the 8-tap Virtex-6 


performance, 3.81% of the LUT, 9.9 % of the flip- 
flop, 29.81% of the slice compared to the PSA-FIR 
filter design, respectivelyy. This proposed FIR 
filter is much suitable for DSP applications 
because it’s required less area. In future work, the 
FIR architecture will be implemented with the help 
of optimal multiplier and optimal adder to further 
improve the FPGA performances. 
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